=3 



MultiG-Rank: Multiple graph regularized protein ranking 

Jing-Yan Wang* 1 

1 Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 
23955-6900, Saudi Arabia 

Email: Jing-Yan Wang - jingyan.wang@kaust.edu.sa; 
'Corresponding author 

Abstract 

Background Protein ranking is a fundamental task in structural biology. Most protein ranking methods rely on 
the pairwise comparison of proteins while neglecting the global manifold structure of the protein database. 
Recently, graph regularized ranking is proposed by exploiting the global structure of the graph defined by 
these pairwise similarities. However, the existing graph regularized ranking methods are very sensitive to 
the choices of graph model and parameters, which remains a tough problem. 

Results To solve this problem, we have developed Multiple Graph regularized Ranking algorithm - MultiG-Rank. 
Instead of using a single graph to regularize the ranking scores, we approximate the intrinsic manifold of 
protein distribution by combining multiple initial graphs for the regularization. The graph weights are learned 
with ranking scores jointly and automatically, by minimizing an object function alternately in an iterative 
algorithm. Experimental results on the ASTRAL SCOP protein database demonstrate that MultiG-Rank 
achieves the better ranking performance comparing to both other single graph regularized ranking methods 
and pairwise similarity base ranking methods. 

Conclusion The problem of graph model and parameter selection in graph regularized protein ranking can be 
effectively solved by combining multiple graphs. This aspect of generalization introduces a new frontier in 
applying multiple graph to solving protein ranking applications. 
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Background 

Ranking and retrieving proteins from a protein database which contain a substructure similar to a query 
protein is a critical task for analyzing of protein structure, function, and evolution in structural biology 
and bioinformatics [TJ[2]. The similar proteins discovered by the ranking system may help the biologist to 
infer functional properties of the query from the returned proteins. For example, if a query protein whose 
function is unknown, but retrieves a large number of database proteins of enzymes, it is also likely to be an 
enzymes |3J. 

The output of the ranking procedure is a list of database proteins that are ranked according to their 
similarity measure to the query in the descending order. The choice of similarity measure largely decides 
the performance of a ranking system as argued by [1] . There are a large number of existing algorithms to 
computing the similarity as ranking scores: 

Pairwise protein comparison compute the similarity between a pair of proteins by protein structure 
alignment or protein features comparison. Protein structure alignment based methods compare protein 
structures at a level of residues, sometime even atoms, to detect structural similarities with high 
sensitivity and accuracy. For example, Carpcnticr ct al. proposed YAKUSA [5] which compares 
protein structures using a one-dimensional characterizations based on protein backbone internal angles, 
while Jung and Lee proposed SHEBA [5] for structural database scanning based one the environmental 
profiles. Protein feature based methods extract structure features and compute the similarity using a 
similarity or distance function. For example, Zhang et al. used the 32-D tableau feature vector in 
a comparison procedure called IR tableau [7] , while Lee and Lee introduced a measure called WD AC 
(Weighted Domain Architecture Comparison) used in protein comparison context [7] . Both use cosine 
similarity for comparison purposes. 

Graph based similarity learning Traditional protein comparison methods mentioned above focus on de- 
tecting pairwise sequence alignments while neglecting all other proteins in the database and their dis- 
tributions. To tackle this problem, graph-based transductive similarity learning algorithm has been 
proposed [3J2]. Instead of focusing on computing the similarity for a pair of proteins, this kind of 
methods take advantage of the graph formed by the existing proteins. By collectively propagating the 
similarity measures to the query protein and between the database proteins via graph transduction 
(GT), we can learn a better metric for ranking. 

The key component of graph based ranking is the construction of graph as the estimation the intrinsic 
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manifold of the database. As argued by Cai et al. jS], there are many choices to define different graphs with 
different models and parameters. However, up to now, there are no explicit rules for choice of graph model 
and parameters in general. In [3] , the graph parameters are determined by grid-search of different pairs of 
parameters. In [8] , several graph models are considered for graph rcgularization, and exhaustive experiment 
are also carried for graph model and parameters selection. However, this kind of grid-search strategies selects 
parameters from discrete values in the parameter space, thus lack the ability to approximate the optimal 
solution. At the same time, the cross-validation [H1H0] can also be utilized for parameter selection, but it 
does not always scale up very well for many graph parameters, and sometimes might over fit the training 
and the validation set while not generalizing well on the query set. 

In Geng ct al. proposed an ensemble manifold rcgularization (EMR) framework, which combines the 
automatic intrinsic manifold approximation and semi-supervised learning (SSL) |12U13| of support vector 
machine (SVM) [14l[T5]. Inspired by EMR, we try to solve the problem of graph model and parameter 
selection by fusing multiple graphs for ranking score learning of protein ranking. We first outline the graph 
regularized ranking score learning framework by optimizing the ranking scores learning with both relevant 
and graph constrains, and then generalize it to multiple graph case. By pre-computing a pool of some 
initial guesses of the graph Laplacian with different graph models and parameters, we try to combine them 
linearly to approximate the intrinsic manifold. The optimal graph model(s) with optimal parameters will be 
selected by assigning them with larger weights. Meanwhile, the learning of ranking scores are also learned 
by restricting to be smooth along the estimated graph. The graph weights and ranking scores are learned 
jointly, leading to a unified object function. The object function is optimized alternately and conditionally 
with respect to multiple graph weights and ranking scores in an iterative algorithm. We name our Multiple 
Graph regularized Ranking method as MultiG-Rank, which is composed of an off-line graph weights 
learning algorithm and an on-line ranking algorithm. 

Method 

Given a set of protein data represented by their Tableau 32-D feature vectors [7] X = {x\,x%, ■ ■ ■ , xjv}, 
where x% G R 32 is the Tableau feature vector of i-th protein, x q is the query protein while others arc 
database proteins. We define the ranking score vector as f = [fi, fi, /n] T G R n in which fa is the ranking 
score of Xi to the query. The protein ranking problem is to rank the proteins in X according to the ranking 
scores in a descending order, and return the several top ones as ranking results, so that the returned proteins 
can be as relevant to the query as possible. In our work, we define two proteins relevant if they belong to 
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the same SCOP fold type [16], and irrelevant otherwise. We denote the SCOP fold type label of proteins in 
X as C = {lx, I2, In} j where 1% is the label of i-th protein. The optimal ranking scores of relevant proteins 
{xi},li — l q should be larger than the irrelevant ones {xi},li ^ l q , so that the relevant proteins will be 
returned to the user. 

Graph regularized protein ranking 

The protein ranking problem is to learn a optimal ranking score vector f. We apply two constraints on f to 
learn the optimal ranking scores: 

Relevance constraint / should be consistent with protein relevant to the query provided by user, because 
the query protein reflects user's search intention. We also define the relevance vector of protein as 
y = [yi->V2> ' ' ' iUn] T £ {1, 0} where y, = 1, if Xi is relevant to query and y,i = 0, otherwise. Since 
the type label l q of a query protein x q is usually unknown, we only know that the query is relevant to 
itself while have no prior knowledge whether others are relevant to query or not, so we can only set 
y q = 1 for sure and j/j, i 7^ q unknown. 

To assign different weights to different proteins in X, we define a diagonal matrix U as Uu = 1 if yi is 
known, and Uu — otherwise. To impose the relevant constraint to the learning of /, we propose to 
minimize the following objective function: 

N 

minO r (t)=J2(fi-Vi) 2 Uu 

»=i \ ) 

= ({-y) T U(i-y) 

Graph constraint / should also be consistent with the local distribution of protein database. We embed 
the local distribution into a K nearest neighbor graph Q = {V, £, W}. For each protein xi, its K 
nearest neighbors excluding itself is denoted by Mi- The node set V corresponds to N proteins in X, 
while £ is the edge set, and e £ if Xj £ Mi or Xj e A/}. The weight of a edge is denoted 
as Wij which can be computed using different graph definitions and parameters as in the next section. 
The edge weights are further organized in a weight matrix W = [Wij] £ M. NxN , where Wij is the 
weight of edge We expect that if two proteins Xi and Xj are close {i.e., Wij is big), fa and fj are 

also close to each other. To impose the graph constraint to the learning of /, we propose to minimize 
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the following objective function: 

1 N 

mm 0»(/) = - £ Ui-hfWa 

2,7 — 1 

T T (2) 

= f T Lf 

where D is a diagonal matrix whose entries are Da = 52 i=1 W*j ano - £ = -D ~ is the graph Laplacian 
matrix. 

By combining the two constraints, the learning of f is based on the minimization of the following objective 
function: 

min 0(f) = O r {i) + a0 9 (i) 

(3) 

= (f-y) T [/(f-y) + af T if 
where a is a trade-off parameter. The solution is easily obtained by setting the derivative of 0(f) with 
respect to f to zero as f = (U + aL)~ 1 Uy. In this way, we employ the information from both the query 
protein provided by the user and relationship of all the proteins in X to rank the proteins in X . The query 
information is embedded in y and U, while the protein relationship information is embedded in L. The final 
ranking results are obtained by balancing the two sources of information. We call this method as Graph 
regularized Ranking (G-Rank) in this paper. 

Multiple graph learning and ranking: MultiG-Rank 

In this section, we propose the multiple graph learning to directly learn a self-adaptive graph for ranking 
regularization, in which the graph is assumed to be a linear combination of multiple predefined graph (referred 
to as base graphs). The graph weights are learned in a supervised way considering the SCOP fold types of 
the database proteins. 

Multiple graph regularization 

The key component of graph regularization is the construction of graph. There are many choices to find the 
neighbors A/i of x\ and to define the weight matrix W on the graph as declared by [5]. We list several of 
them as follows: 

• Heat kernel weighted graph: A/i of Xi is found by comparing squared Euclidean distance as 

1 ~~ X j 1 1 — X X>i 2t7j ^ X j | X j Xj 
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and the weighting is computed using a Heat kernel as 

{ l^-xjll 2 
e ° , if(i,j)e£ (5) 
0, else 

where a is the bandwidth of the kernel. 

Dot-product weighted graph: Mi of Xi is found by comparing squared Euclidean distance and the 
weighting is computed as dot-product 



Xj, if e £ 



W « = r :" ' else < 6 > 



Cosine similarity weighted graph: Mi of Xi is found by comparing cosine similarity as 

C{ Xi ,Xj)= (7) 
and the weighting is also assigned as cosine similarity 

w -f c ( x i> x j)> if(i,j)e£ , , 

WlJ ~\ 0, else W 

Jaccard index weighted graph: Mi of x, is found by comparing the Jaccard index [17] 
and the weighting is also assigned as 



Wu = ) J ( X *> X ^> l f e £ (10) 
3 \ 0, else v ' 

• Tanimoto coefficient weighted graph: M\ of Xi is found by comparing the Tanimoto coefficient 

T(x i ,x j )= ,. 2 J*%_ T— . (U) 

I I *^ * I I ' I I *^ J I I *^ 2 J 

and the weighting is also assigned as 

m . = ( r (^)' i/(M') ef (12) 

vJ « Gl SG 

Given so many choices of graphs, the most suitable graph with its parameters for the protein ranking 
task is often unknown in advance, thus an exhaustive search on a predefined pool of graphs will be necessary. 
However, when the size of the pool becomes large, the exhaustive search will be quite time-consuming and 
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sometimes not possible. Hence, learning an appropriate graph efficiently to make the performance of the 
employed graph-based ranking method robust or even improved is crucial for graph regularized ranking. 
To tackle the aforementioned problems, we propose an multiple graph regularized ranking framework, by 
providing a series of initial guesses of the graph Laplacian and combining them to approximate the intrinsic 
manifold in a conditionally optimal way, inspired by 1 1 1 j . 

Given a set of M graph candidates {Qi, • • • , Gm}, we denote their corresponding candidate graph Lapla- 
cians as T = {Li, ■ • ■ , Lm}- By assuming that the optimal graph Laplacian lies in the convex hull of the 
pre-given graph Laplacian candidates, we constraint the search space of possible graph Laplacians as linear 
combination of L m in T as 



where /i m is the weight of m-th graph. To avoid the negative contribution, we further constraint Yl m =i ^™ = 
1, > 0. 

To utilize the information from data distribution approximated by the new composite graph Laplacian 
L in (1131) for protein ranking, we introduce a new multi graph regularization term. Substituting (|13|) into 
([21), we have the augmented objective function term in an enlarged parameter space 



where fj, = [/ii, • ■ • ,^m] T is the graph weight vector. 
Off-line supervised multiple graph learning 

In the on-line querying procedure, the relevance of query x q to database proteins is unknown, thus the 
optimal graph weights ji cannot be learned in a supervised way. However, all the SCOP type labels of 
protein database are known, making supervised learning of /i in a off-line way possible. We treat each 
database protein x q £ D, q = 1, • • • , N as a query in the off-line learning, and all the items of its relevant 
vector y q = [yi q , • • • , yNq\ T is known since all the labels are known for all database proteins, as 




(13) 




(14) 



M 




m—1 




(15) 



Therefore, we set U = I 



NxN 



as a N x N identity matrix. The ranking score vector of q-th database protein 



is also defined as i q = [yi q , ■ ■ ■ ,yNq] T ■ Substituting f g , y q and C/ to (jT|) and (fT4")) and combining them, we 
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have the optimization problem for q-th database protein as follows: 

M 



min 0{f q , u) = (f, - y 9 ) T (f 9 - y,) + a £ Mm (f}i m f 9 ) + /3|| M || 2 

m=l (16) 
A/ v ; 

S.t 2J = 1) Mm > 0. 

m— 1 

To avoid the parameter /x overfitting to one single graph, we also introduce the I2 norm regularization term 
\\n\\ 2 to the object function. We must notice the difference between f q and y q . In fact the f q <G {1,0}^ 
plays the role of given ground truth in the supervised learning procedure, while y q G M. N is the variable to 
be solved. f q is the ideal solution of y q but we cannot achieve it after the learning possibly. In (|16p . we 
introduce the first term to make y q as similar to f q as possible during the learning procedure. 

Object function: Using all proteins in database q = 1, . . . , JV as queries to learn fi, we obtain the final 
object function of supervised multiple graph weighting and protein ranking problem: 



N 

min 0(F, u) = 



A I 



PWiA 



ra—1 
M 

= Tr[(F-Y) T (F-Y)] + a £ u m Tr(F T L rn F) + /3\\u\\ 2 (17) 

m— 1 

M 

S.t. 2J Mm = 1, Mm > 0. 

m— 1 

where F = [fi , • ■ • , fjv] is the ranking score matrix with its q-th column as the ranking score vector of q-th 
protein, and Y — [y 1; • • • ,y N ] is the relevance matrix with its q-th column as the relevance vector of g-th 
protein. 

Optimization: Since direct optimization to (fTTj) is difficult, we instead adopt an iterative, two-step 
strategy to alternately optimize F and (i. At each iteration, one of F and (i is optimized while the other 
is fixed, and then the roles of F and \x are switched. Iterations are repeated until a maximum number of 
iterations is reached. 

• On optimizing F: With the fixed graph weight [i, the analytic solution for problem (|17l) can then be 
easily obtained by setting the derivative of 0(F, /n) with respect to F to zero. That is, 

M 



dO(F,fji) 



dF 



= 2{F-Y) + 2aJ2 Mm(imF) = 



M 



(18) 
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• On optimizing a: By fixing F and removing items irrelevant to a from (|17|) . the optimization problem 
(JTTJ) is reduced to 

M 

min a u m Tr(F T L m F) + /?||/i|| 2 

m— 1 

Af M 
m— 1 m— 1 V ' 

= ae T /i + /3/i T ^ 
M 

si. // m = 1, /i m > 0. 

m— 1 

where e m = Tr(F T L m F) and e = [ei, • • • , eA/] T - The optimization of (|19p with respect to the graph 
weight a can be solved as a standard quadratic programming (QP) problem [JJ. 

Off-line algorithm: We summarize the off-line u learning algorithm in Algorithm [TJ 



Algorithm 1 MultiG-Rank: off-line graph weights learning algorithm. 
Require: Candidate graph Laplacians set T; 
Require: SCOP type label set of database proteins C; 
Require: Maximum iteration number T; 

Construct the relevance matrix Y = [yi q ] NxN where yi q if li = l q , otherwise; 

Initialize the graph weights as \$ m = m = 1, • • ■ , M; 

for t = l, - - ,T do 

Update the ranking score matrix F l according to previous fj,^ 1 by (fT5)) ; 
Update the graph weight /i* according to updated f* by (fTi?|k 

end for 

Output graph weight /i = /i* . 



On-line ranking regularized by multiple graphs 

Given a new discovered protein submitted by a user as query Xq, its SCOP type label Iq is unknown and it 
is not in the database T> — {x\, ■ ■ ■ , xn}- To compute the ranking scores of Xi S T> to query xo, we extend 
the size of database to N + 1 by adding xo into database and solve the ranking score vector for xo which is 
defined as f = [/o, • • • , /at] G R n+1 by The parameters in §5§ are constructed as follows: 

• Laplacian matrix L: We first compute the m graph weight matrices {W m }^ =1 € T^( N + 1 ) X ( N + 1 ) 
with their corresponding Laplacian matrices {£ m }^f =1 & K( Ar + 1 ) x ( iv + 1 ) for the extended database 
{xo-, x±, ■ ■ • , xn}- Then with the graph weight u learned by Algorithm [TJ the new Laplacian matrix L 
can be computed as (fT5]). 
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On-line graph weight computation: When a new query xq comes in, we calculate its K nearest neighbors 
in the database V and the corresponding weights Woj and Wjo,j = 1, • " >N. Suppose that the adding 
of this new query to the database docs not affect the graph in the database space, so the neighbors 
and weights W%j, i,j = 1, • • • , N for the database proteins are fixed and can be pre-computed off-line. 
In this way, we only need to compute N edge weights for each graph instead of (N + 1) x (N + 1). 

• Relevance vector y: The relevance vector for xq is defined as y = [yo, ■ ■ ■ , yjv] T € {ljO}^" 1 " 1 with 
only yo — 1 known while other yj, i = 1, • • • , N unknown. 

• Matrix U : In this situation, U is a (N + 1) x (JV + 1) diagonal matrix with Uqo = 1 and Ua = 0, 
i = l,--- ,N. 

Then the ranking score vector / can be solved as 

i=(U + aL)- 1 Uy (20) 

We summarize the on-line ranking algorithm as in Algorithm [2j 

Algorithm 2 MultiG-Rank: on-line ranking algorithm. 

Require: Protein database V = {xi, ■ ■ • , xn}', 
Require: Query protein xq; 
Require: Graph weight /i; 

Extend the database to (N + 1) size by adding xq and compute M graph Laplacians of the extended 

database; 

Obtain multiple graph Laplacian L by linear combination of M graph Laplacians with weight /i as (|13|): 
Construct the relevance vector y 6 M^ JV+1 ' where yo = 1 and diagonal matrix U € R( Ar+1 ) x ( JV + 1 ' with 
Uu = 1 if i = and otherwise; 
Solve the ranking vector f for xq as in (|20l) ; 

Ranking proteins in T> according to ranking scores f in descending order. 



Experiments 

Protein database and query set 

We use the SCOP 1.75A database [18] to construct our database and query set. In SCOP 1.75A database, 
there are 49,219 protein PDB entries and 135,643 Domains, belonging to 7 classes, and 1,194 SCOP fold 
types. 

Protein database 

Our database is selected from the ASTRAL SCOP 1.75 A set [H] . The ASTRAL is a compendium providing 
databases and tools for analyzing protein structures and their sequences and it is partially derived from, and 
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augments the SCOP (Structural Classification of Proteins) database. Its current version is 1.75A, released 
at March 15, 2012 [T5]. A subset of SCOP 1.75A database — the ASTRAL SCOP 1.75 A genetic domain 
sequence subsets (ASTRAL SCOP 1.75A 40%) [TH] is used as our database T>. This database is selected from 
SCOP 1.75A database so that the selected domains are with less than 40% identity to each other. There 
are totally 11,212 protein domains in the ASTRAL SCOP 1.75A 40% database belonging to 1,196 SCOP 
fold types, which are available on-line at http:/ /scop. berkeley.edu. The distribution of protein numbers of 
different fold types are shown in Fig. [TJ We must notice that many previous works in this field evaluated 
the ranking performance on the old version of ASTRAL SCOP dataset (ASTRAL SCOP 1.73 %95) released 
in 2008 [7J. But since new version has been released in 2012 (ASTRAL SCOP 1.75A %40), we choose to use 
the new version in our experiment. 
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Figure 1: Distribution of protein numbers of different fold types in ASTRAL SCOP 1.75A 40% database. 



Query set 

We also select 540 protein domains from SCOP 1.75 A database to construct a query set. Each query protein 
domains has at least one protein domains belonging to the same SCOP fold type from the ASTRAL SCOP 
1.75A 40% database, so that for each each query, there will be at least one "positive" sample in the database. 
We call our database and query set as 540 query dataset in our paper since it contains 540 protein domains 
from SCOP 1.75A database. 

Evaluation metrics 

We run a ranking procedure using a query and returns a list of all database proteins along with their ranking 
scores to the query. We adopt the same evaluation metric framework as [7], and use the receiver operating 
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characteristic (ROC) curve, the area under this ROC curve (AUC) and the recall-precision curve to evaluate 

the ranking accuracy. Given a query protein x q belonging to the SCOP fold l q , a list of proteins will be 

returned from the database by the on-line MultiG-Rank algorithm or other ranking methods. For a database 

protein xq among the returned list, if its fold label l r is the same as that of xq, i.e. l r = Iq it will be identified 

as a true positive (TP), else it will be identified as a false positive (FP). For a database protein x r i not 

among the returned list, if its fold label l r i ^ Iq, it will be identified as a true negative (TN), else a false 

negative (FN). We then can then compute the true positive rate (TPR), false positive rate (FPR), recall 

and precision basad on the above statistics as follows: 

TP FP 
TPR = , FPR 



TP + FN' FP + TN (7] 

TP . . TP [ ' 

recall = , precision = 

TP + FN' 1 TP + FP 

By varying the length of the returned list, we will have different TPR, FRP, recall and precision values. 

ROC curve Using FPR as the abscissa and TPR as the ordinate, we can plot the ROC curve. For a 
high-performance ranking system, the ROC curve should be close to the top-left corner. 

Recall-Precision curve Using recall as the abscissa and precision as the ordinate, Recall-Precision curve 
can be ploted. For a high-performance ranking system, this curve should be close to the top-right 
corner of the plot. 

AUC The AUC is also computed as a single-figure measurement for the quality of an ROC curve. We 
average AUC over all the queries to evaluate the performances of different ranking methods. 

Results 

In this section we first compare our MultiG-Rank against several popular graph based ranking score learning 
methods for ranking of protein domains, represented by Tableau Then, we also evaluate the ranking perfor- 
mance of MultiG- Ranking with other protein ranking methods using different protein comparison strategics. 

Comparison against other graph based ranking methods 

We compare our MultiG-Rank to two graph based ranking methods — G-Rank and GT [J]. The Cosine 
similarity is used as a baseline pairwise similarity in this experiment, marked as "Pairwise Rank" in the figure. 
The evaluations are conducted with 540 queries of the 540 query set. The average ranking performance is 
computed over these 540 query running. 
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FPR Recall 

(a) ROC curve (b) Recall-precision curve 

Figure 2: ROC curve and Recall-precision curve of different graph-based ranking methods 



Fig. [5] shows the recall-precision curve and ROC curve by using different graph ranking methods. Each 
curve in the figures represents a graph based ranking score learning algorithm. As can be seen, our MultiG- 
Rank algorithm significantly outperforms the other graph based ranking algorithms. As shown in Fig. [2] 
(b), the precision difference gets larger as the recall value increases and then tend to converge to zero. The 
G-Rank algorithm outperforms GT in most cases. However, both G-Rank and GT are much better than 
pairwise ranking neglecting the global distribution of protein database. 

Table 1: AUC results off different graph-based ranking methods. 



Method 


AUC 


MultiG-Rank 


0.9730 


G-Rank 


0.9575 


GT 


0.9520 


Pairwise- Rank 


0.9478 



Table Q] tabulates the AUC results of different graph-based methods on the 540 query set. The best AUC 
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result of each method was recorded for a fair comparison. As shown in this table, the proposed MultiG- 
Rank consistently outperforms the 3 compared methods on our database. It achieved the gain in accuracy 
of 0.0155, 0.0210 and 0.0252 on 540 query set compared with the other best compared method, respectively. 
This way, the ranking precision can be significantly improved by using our algorithm. The two single graph 
based ranking methods — GT and G-Rank algorithms achieve similar AUC values (about 0.95), while they 
are outperformed by MultiG-Rank significantly (about 0.02). 

We have made three observations from the results listed in Table [TJ 

1. G-Rank and GT obtain similar performance on database, indicating that there is no significant differ- 
ence on the performance of either graph transduction based or graph regularization based for unsuper- 
vised learning of the ranking score using a single graph. 

2. Pairwise ranking obtains the worst performance even though it use a carefully selected similarity 
function as reported in [7j . The reason is that the similarity computed by pairwise ranking focus only 
on detecting statistically significant pairwise difference while missing more subtle sequence similarity. 
Hence, the variance among different fold types cannot be accurately estimated in this scenario by 
neglecting the global distribution and only considering the protein pairs compared. Another possible 
reason is that pairwise ranking usually obtains better ranking performance when the number of proteins 
of the database is small. However, when the number of protein is large as in our database, the ranking 
performance of pairwise ranking is poor. 

3. MultiG-Rank obtains the best ranking performance, which implies that both discriminant and geomet- 
rical information of protein database are important for accurate ranking. The geometrical information 
is estimated by multiple graphs. The discriminant information is included when graph weights are 
learned with help of SCOP fold type labels in our algorithm. 

Comparison with other protein ranking methods 

In this experiment, we compare our MultiG-Rank against several popular protein ranking methods — IR 
Tableau [7] , QP tableau [1] , YAKUSA [5] , and SHEB A [6] . For the purpose of comparison, we first consider 
different methods for protein-protein comparison to compute the similarity or dissimilarity. The ordering 
technique is devised to detect hits by taking the similarities between data pairs as input. For our MultiG- 
Rank, the ranking score plays the role of protein-protein similarly. The AUC values are reported in Table 
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Table 2: AUC results for different protein ranking methods. 



Method 


AUC 


MultiG-Rank 


0.9730 


IR Tableau 


0.9478 


YAKUSA 


0.9537 


SHEBA 


0.9421 


QP tableau 


0.9364 



El 

It can be observed from Table [5] that with the advantage of exploring data characteristics from various 
graphs, MultiG-Rank can achieve significant improvements in the ranking outcomes: AUC is increased from 
0.9478 to 0.9730 using the same Tableau feature as IR Tableau. It also outperforms QP Tableau, SHEBA , 
and YAKUSA, and improves the AUC from 0.9364, 0.9421 and 0.9537 to 0.9478. Furthermore, owing to its 
better use of effective protein descriptors, IR Tableau also outperforms QP Tableau. 

To evaluate the effect of using protein descriptors for ranking instead of direct protein structure com- 
parison, we compare IR Tableau with YAKUSA and SHEBA. The main difference among them is that IR 
Tableau considers both protein feature extraction and comparison procedures while YAKUSA and SHEBA 
only compare a pair of proteins directly. The quantitative results in Table [2] show that IR Tableau cannot 
outperform YAKUSA by making use of the additional information from the protein descriptor. This is a 
strong evidence that the ranking performance improvement is mainly archived by the graph regularization, 
but not the power of protein descriptor. 

Conclusion 

The proposed MultiG-Rank introduces a new paradigm to fortify a broad scope of existing graph based 
ranking techniques. Its main advantage lies in the ability of learning a unified space of ranking scores for 
protein database in multiple graph representations. Such a flexibility is important in tackling complicated 
protein ranking problems, and allows one to explore more prior knowledge for effectively analyzing a given 
protein database, including choosing a proper set of graphs to better characterize the manifold of database, 
and adopting a multiple graph-based ranking method to appropriately model the relationship among the 
proteins. Throughout this work, MultiG-Rank has been comprehensively evaluated on a carefully selected 
subset of ASTRAL SCOP 1.75 A protein database. The promising experimental results further consolidate 
the usefulness of our ranking score learning approach. Moreover, MultiG-Rank can also be used to other 
bioinformatics [T9Tf2"4"]. medical imaging [2"5Tf2"8] . biometrics [2T)H3~i] and computer vision [3"5"M38j . 
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